{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Postgres导入CSV中文编码错误\n",
    "\n",
    "现在公司数据资源非常丰富，动辄要处理的csv就1E8行纪录，10G，分析起来相当爽。编码格式是cp936，用官方的COPY命令导入Postgres，一直出错，目前COPY的ENCODING不支持cp936。试着用Pandas读几行，保持csv为utf-8再COPY没问题。\n",
    "\n",
    "解决办法：通过Python3的Pandas读取CSV转换成utf－8，再利用sqlalchemy＋psycopg2（目前只支持到Python3.4，还不支持3.5）导入即可。  \n",
    "\n",
    ">这么多依赖推荐使用[Anaconda](https://www.continuum.io/downloads)，十分方便，尤其适合解决在Windows平台编译C语言相关库（lxml，psycopg2，numpy，pandas）时的各种奇葩问题。\n",
    "\n",
    "具体方法如下：\n",
    "\n",
    "<!-- TEASER_END-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sqlalchemy import create_engine\n",
    "import ipython_memory_usage.ipython_memory_usage as imu"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "In [2] used 0.0195 MiB RAM in 0.12s, peaked 0.00 MiB above current, total RAM usage 61.92 MiB\n"
     ]
    }
   ],
   "source": [
    "imu.start_watching_memory()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "In [3] used 1.3125 MiB RAM in 0.16s, peaked 0.00 MiB above current, total RAM usage 63.23 MiB\n"
     ]
    }
   ],
   "source": [
    "chunksize = 1E6\n",
    "engine = create_engine('postgresql+psycopg2://postgres:tianKong7@localhost:5432/postgres')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "In [4] used 4.0273 MiB RAM in 0.12s, peaked 0.00 MiB above current, total RAM usage 67.26 MiB\n"
     ]
    }
   ],
   "source": [
    "df = pd.read_csv('c_全年库存结构变动_20151124122730.csv', encoding='cp936', chunksize=chunksize)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "for chunk in df:\n",
    "    chunk.to_sql(\"inventory\", engine, if_exists='append', chunksize=chunksize)\n",
    "    del chunk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# CREATE TABLE inv (\n",
    "#     stat_date\tdate,\n",
    "#     store_src_cd\tint,\n",
    "#     sku_id\tint,\n",
    "#     stock_qty\tint,\n",
    "#     canuse_qty\tint,\n",
    "#     len\tint,\n",
    "#     width\tint,\n",
    "#     height\tint,\n",
    "#     item_first_cate_name\tvarchar(80),\n",
    "#     item_second_cate_name\tvarchar(80),\n",
    "#     item_third_cate_name\tvarchar(80)\n",
    "# );\n",
    "\n",
    "# COPY inv FROM 'D:\\\\reader\\\\dask-tutorial\\\\top200.csv' DELIMITER ',' CSV;"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
